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Can a Web crawler efficiently locate an unknown relevant page? 
While this question is receiving much empirical attention due to its 
considerable commercial value in the search engine community0~§, 
theoretical efforts to bound the performance of focused navigation 
have only exploited the link structure of the Web graph, neglect- 
ing other features§HlL Here I investigate the connection between 
linkage and a content-induced topology of Web pages, suggesting 
that efficient paths can be discovered by decentralized navigation 
algorithms based on textual cues. 



*This work is funded in part by NSF CAREER Grant No. IIS-0133124. 
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Topic driven crawlersou are increasingly seen as a way to address the 
scalability limitations of universal search engines, by distributing the crawl- 
ing process across users, queries, or even client computers. The context 
available to such crawlers can guide the navigation of links with the goal 
of efficiently locating highly relevant target pages. Given the need to find 
unknown target pages, we are only interested in decentralized crawling al- 
gorithms, which can only use information available locally about a page and 
its neighborhood. Starting from some source Web page, we aim to visit a 
target page by navigating a path of length f < iV where I is the number of 
pages visited along the path and N is the total number of pages. 

Since the Web is a small-world network!! we know that the diameter (at 
least for the largest connected componentB§) scales logarithmically with N, 
therefore some short path exists between the source and target nodes such 
that I ~ log N. Can a crawler navigate such a short path? If the only local 
information available is about the hypertext link degree of each node and its 
neighbors, then simple greedy algorithms that always pick the neighbor with 
highest degree lead to paths where the number of links traversed £' scales 
sublinearlyi (£' ~ N@,f3 < 1) or logarithmically^ (£' ~ logiV). However a 
real Web crawler would have to visit all the neighbors of a node to determine 
their degree, and moving to high-degree nodes makes such a strategy useless 
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for actual Web navigation (£(£') ~ N) given the power-law degree sequence 



KleinbergcJ showed that if information about the geographic location of 
nodes is available, then a greedy algorithm that always picks the neighbor 
geographically closest to the target can yield £ ~ (log N) 2 if the link topology 
follows a D-dimensional lattice, with 2D local links to lattice neighbors plus 
one long range connection per node, linking to another node chosen with 
with probability Pr(r) ~ r~ a where r is the lattice distance between the two 
nodes and a is a constant clustering exponent. In this model the optimal 
path length is achieved for a critical clustering exponent dependent on the 
dimensionality of the lattice (a = D). 

Kleinberg's model is inspired by social small-world networks where ge- 
ographical knowledge exists, but in the Web hypertext the notion of geog- 
raphy is of little relevance and the lattice model is unrealistic. However, a 
more relevant topological distance metric can be defined in the Web, namely 
the distance induced by the lexical similarity between the textual content of 
pages. Let us define such a lexical distance 
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where (pi,P2) is a pair of Web pages and s is the cosine similarity function 
traditionally used in information retrieval. The r distance metric is a natural 
local cue readily available in the Web, with the target content specified by a 
query or topic of interest to the user. This metric also does not suffer from 
the dimensionality bias that makes L-norms inappropriate in the sparse word 
vector space. 

To investigate the relationship between the lexical topology induced by 
r and the link topology, I measured the frequency of linked pairs of pages 
as a function of the lexical distance, Pr(r(pi,p2) = p)- 
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where the linkage between two pages was approximated by the overlap 
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and Up is the URL set representing p's neighborhood (inlinks, outlinks, and 
p itself). The threshold A models the ratio of local versus long range links, 
analogous to Kleinberg's dimensionality D. Larger A values imply more 
sparse long range connections. 

Figure [j] shows that long range links are indeed distributed according 



to a power law Pr(p|A) ~ p~ a ( x \ as in Kleinberg's model, but using lexical 
distance. A fit of the tails to the power law model reveals that the clustering 
exponent a grows linearly with the linkage threshold A (inset). 

Such a surprising result makes Kleinberg's analysis applicable to the 
Web. Crawlers that exploit textual cues may be able to navigate through 
short paths and locate unknown relevant pages. This is encouraging for the 
community of crawling algorithm designers. The actual bound on the length 
of the path I depends on whether the clustering exponent is near a critical 
value; this will have to be determined by extending Kleinberg's analysis to 
the lexical topology of the Web. 

The observed relationship between link and lexical Web topology has 
another interesting implication. One of the recent attempts to explain the 
power law distribution of Web degree sequences is based on a preferential 
attachment model in which new nodes are linked to existing ones based on 
a critical mixture of linkage bias (attach to a node within a few links from 
most other nodes) and geographic bias (attach to a node within a small 
Euclidean distance), where nodes are given random coordinates in the unit 
squarJHI. The present result could lead to a more realistic interpretation 
whereby authors would link their new pages to sites that are both popular 
and related in content, i.e., central in link space and nearby in lexical space. 
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Figure 1: Linkage probability Pr(p) as a function of lexical distance p, for 
various values of the linkage threshold A. The frequency data is based on 
approximately 6 x 10 8 pairs of Web pages sampled from the Open Directory 
(dmoz.org). The least-squares fit of the tail of each distribution to the 
power law model Pr(p) ~ p~ a is also shown. The inset plots the relationship 
between the effective dimensionality induced by the linkage threshold A and 
the clustering exponent a of the power law tail. 
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